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Summary. For high-dimensional classification, it is well known that naively performing the Fisher discriminant 
rule leads to poor results due to diverging spectra and noise accumulation. Therefore, researchers proposed 
independence rules to circumvent the diverging spectra, and sparse independence rules to mitigate the issue 
of noise accumulation. However, in biological applications, there are often a group of correlated genes respon- 
sible for clinical outcomes, and the use of the covariance information can significantly reduce misclassification 
rates. In theory the extent of such error rate reductions is unveiled by comparing the misclassification rates 
of the Fisher discriminant rule and the independence rule. To materialize the gain based on finite samples, 
a Regularized Optimal Affine Discriminant (ROAD) is proposed. ROAD selects an increasing number of fea- 
tures as the regularization relaxes. Further benefits can be achieved when a screening method is employed 
to narrow the feature pool before hitting the ROAD. An efficient Constrained Coordinate Descent algorithm 
(CCD) is also developed to solve the associated optimization problems. Sampling properties of oracle type 
are established. Simulation studies and real data analysis support our theoretical results and demonstrate the 
advantages of the new classification procedure under a variety of correlation structures. A delicate result on 
continuous piecewise linear solution path for the ROAD optimization problem at the population level justifies 
the linear interpolation of the CCD algorithm. 

Keywords: High Dimensional Classification, LDA, Regularized Optimal Affine Discriminant, Fisher Discrimi- 
nant, Independence Rule. 

1. Introduction 



Technological innovations have had deep impact on society and on various areas of scientific research. 
High-throughput data from microarray and proteomics technologies are frequently used in many contem- 
porary statistical studies. In the case of microarray data, the dimensionality is frequently in thousands 
or beyond, while the sample size is typically in the o rder of tens. The large- jp-small-n scenario poses 
challenges for the classification problems. We refer to iFan and Lv (20101 ) for an overview of statistical 
challenges associated with high dimensionality. 

When the feature space dimension p is very high compared to the s ample size n, the Fisher d iscrimi- 
nant rule performs poorly due to diverging spectra as demonstrated bv lBickel and Levina (20041 ) . These 
authors showed that the independence rule in which the covariance s tructure is ignored p erforms better 
than the naive Fisher rule (NFR) in the high dimensional setting. IFan and Fan (20081 ) demonstrated 
further that even for the independence rules, a procedure using all the features can be as poor as random 
guessing du e to noise accumulat ion in estimating population centroids in high-dimensional feature space. 
As a result, IFan and Fan (20081) proposed the Fe atures Annealed Ind ependence Rule (FAIR) that selects 



a subset of important features for classification. iDudoit et al. (20021 ) r eported that for microa rray data, 
ignoring correlations between genes leads to better classification results. iTibshirani et al. (20021 ) proposed 
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the Nearest Shrunken Centroid (NSC) which likewise employs the workin g independence structure. Sim 



ilar problems are also studied in the machine learning community such as lDomingos and Pazzani ("19971 ) 



and 



Lewis (19981 ) 



In microarray studies, correlation among different genes is an essential characteristic of the data and 
usually not negligible. Other examples include proteomics, and me tabolomics data where correlatio n 



among biomarkers is commonplace. More details can be found in lAckermann and Strimmer (20091 ). 
Intuitively, the independence assumption among genes leads to loss of critical information and hence is 
suboptimal. We believe that in many cases, the crucial point is not whether to consider correlations, 
but how we can incorporate the covariance structure into the analysis with a bullet proof vest against 
diverging spectra and significant noise accumulation effect. 

The setup of the objective classification problem is now introduced. We assume in the following that 
the variability of data under consideration can be described reasonably well by the means and variances. 
To be more precise, suppose that random variables representing two classes C\ and C2 follow p-variatc 
normal distributions: X|Y = 1 ~ A^/z^E) and X|F = 2 ~ A/" p (/x 2 ,S) respectively. Moreover, assume 
P(y = 1) = 1/2. This Gaussian discriminant analysis setup is known for its good performance despite 
its rigid model structure. For any linear discriminant rule 

MX) = i{w T (x - n a ) > 0}, (1) 

where fi a = (fi 2 + A*i)/2j and I denotes the indicator function with value 1 corresponds to assigning X 
to class C2 and class C±, the misclassification rate of the (pseudo) classifier (5 W is 

W(K) = i/MMX) = 0) + iPi(MX) = 1) = 1 - ^wV./fw^w) 1 / 2 ), (2) 

where \± d = (fj, 2 — /x 1 )/2, and Pi is the conditional distribution of X given its class label i. We will focus 
on such linear classifier <5 W (-), and the mission is to find a good data projection direction w. Note that 
the Fisher discriminant 

MX) =!{(£- V d ) T (X-/x Q )>0} (3) 

is the Bayes rule. There are two fundamental difficulties in applying the Fisher discriminant whose 
missclassification rate is 

1 - $ (WIT V d ) 1/2 ) • (4) 
The first difficulty arises from the noise accumulation effect in estimating the population centroids 



([Fan and Fan. 20081 ) when p is l arge. The second challen ge is more severe: estimating the inverse of 



covariance matrix S when p > n (jBickel and Levina. 20041 ). As a result, much previous researches focus 



on the independence rules, which act as if £ is diagonal. However, correlation matters! 

To illustrate this point, consider a case when p = 2. These two features can be selected from the 
original thousands of features, and we can estimate the correlation between two variables with reasonable 
accuracy. Let 

-(::)• 

where p £ [0, 1) and fi d = (/ii,/i2) T . Without loss of generality, assume \fj,i\ > \n%\ > 0. The misclassifi- 
cation rate of Fisher discriminant depends on 

\(P) = Md^-Vd = T^-^M? + t4 - ipiWMt)- (5) 
1 — p z 
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Note that 



A' p (p) > /iip 2 p 2 - (/i? + + /n/i 2 < 0. 



Therefore, when p\p2 < 0, Ap(p) > for all p G [0,1). On the other hand, when ^1^2 > 0, A p (p) 
decreases on p <E (0, j^-), and increases on (^-, 1). Notice that when p — > 1, A p — > 00 regardless of signs 
for pip.2, which in turn leads to vanishing classification error. On the other hand, if we use independence 
rule (also called naive Bayes rule), the optimal misclassification rate 



iMdlli 



(6) 



depends on T(p) = which is monotonically decreasing for p g [0,1), with the limit 

(/if + p.?) 2 1 (Mi + Z^) 4 that is smaller than unity when p\ and pi have the same sign. Hence, the optimal 
classification error using the independence rule actually increases as correlation among features increases. 

The above simple example shows that by incorporating correlation information, the gain in terms of 
classification error can be substantial. Elaboration on this point in more realistic scenarios is provided in 
Section 2. Now it seems wise to use at least a part of covariance structure to improve the performance of 
a classifier. So there is a need to estimate the covariance matrix S. Without structural assumptions on 
£, the pooled sample covariance £ is one natural estimate. But for p > n, it is not considered as a good 
estimate of X in general. We are lucky here because our mission is not constructing a good estimate 
of the covariance matrix, but finding a good direction w that leads to a good classifier. To mimic the 
optimal data projection direction S _1 // d , we do not adopt a direct plug-in approach, simply because it 
is unlikely that a product is a good estimate when at least one of its components is not. Instead, we 
find the data projection direction w by directly minimizing the classification error subject to a capacity 
constraint on w. From a broad spectrum of simulated and real data analysis, we are convinced that this 
approach leads to a robust and efficient sparse linear classifier. 

Admittedly , our work is far from the first to use covariance for classification; support vector machines 
i Vapnik. 19 95). for example, implicitly utilize covariance between covariates. Anoth er notable work is 



"shrunken centroids regularized discriminant analysis" (SCRDA) (|Guo et al.. 20051 ), which calls for a 



version of regularized sample covariance matrix S rog , and soft-thresholds on S rog Xi. iShao et al. (20111 ) 
consider a sparse linear discriminant analysis, assuming the sparsity on both the covariance matrix and 
the mean difference vector so that they can be regularized. They show that such a regularized estimator 
is asymptotically optimal under some conditions. However, to the best of our knowledge, this work is the 
first to select features by directly optimizing the misclassification rates, to explicitly use un-regularized 
sample covariance information, and to establish the oracle inequality and risk approximation theory. 
There i s a huge literature on h igh dimensional classification. Examples i nclude principal componen t 



Huang (2003) and 



3) 



analysis in iBair et al. (2009) and IZou et al. (200® ) , partial least squares in iNguven and Rocke (20021) 



Boulesteix (20041 ). and sliced inverse regression in lLi (19911) and lAntoniadis et al. (20031 ) 



The rest of our paper is organized as follows. Section [5] provides some insights on the performances of 
naive Bayes, Fisher discriminant and restricted Fisher discriminants. In Section[31 we propose the Regu- 
larized Optimal Affinc Discriminant (ROAD) and variants of ROAD. An efficient algorithm Constrained 
Coordinate Descent (CCD) is constructed in Section [4] Main risk approximation results and continuous 
piecewise linear property of the solution path are established in Section [SJ We conduct simulation and 
empirical studies in Section [SJ A discussion is given in Section [71 and all proofs are relegated to the 
appendix. 
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2. Naive Bayes and Fisher Discriminant 



To compare the naive Bayes and Fisher discriminant at the population level, we assume without loss of 
generality that variables have been marginally standardized so that S is a correlation matrix. Recall 
that the naive Bayes discriminant has error rate ([5]) and the Fisher discriminant has error rate @. Let 
T p = ||/x d |||//iJS/i d and A p = Denote by {A 2 }f =1 the eigenvalues and {£j}f = i eigenvectors 

of the matrix S. Decompose 

Md = a i£i + ---+ a p£ P > ( 7 ) 

where {ai}f =1 are the coefficients of fi d in this new orthonormal basis {£j}f = i- Using the decomposition 
dTj) , we have 

a p = E °j Ai. t p = (E a f)i E v*- ( g ) 

3=1 3=1 3=1 

The relative efficiency of Fisher discriminant over naive Bayes is characterized by A p /T p . By the Cauchy- 
Schwartz inequality, 

A P /r p > i. 

The naive Bayes method performs as well as the Fisher discriminant only when n d is an eigenvector of 
S. 

In general, A p /T p can be much larger than unity. Since S is the correlation matrix, X^=i Aj = 
tr(S) = p. If fi d is equally loaded on £ ■, then the ratio 

A P /r P = p- 2 E A3 E v 1 = E v 1 - ( 9 ) 

3 = 1 3=1 3 = 1 

More generally, if {aj}^ =1 are realizations from a distribution with the second moment a 2 , then by the 
law of large numbers, 

E^^EVA* ^Es 2 -^ Ea.';--Ea, 

3 = 1 3=1 3 = 1 3 = 1 3=1 

Hence, © holds approximately in this case. In other words, the right hand side of © is approximately the 
relative efficiency of the Fisher discriminant over the naive Bayes. Now suppose further that half of the 
eigenvalues of S are c and the other half are 2 — c. Then, the right hand side of © is (c _1 + (2 — c) _1 )/2. 
For example when the condition number is 10, this ratio is about 3. A high ratio translates into a large 
difference in error rates: 1 — $(r^/ 2 ) for independence rule is much larger than 1 — <I>(3ry 2 ) for Fisher 

1 /2 

discriminant. For example, when T p = 0.5, we have 30.9% and 6.7% error rates respectively for the 
naive Bayes and Fisher discriminant. 

To put the above arguments under a visual inspection, consider a case in which p = 1000, fi d = 
(fi^,0 T ) T with /x s = (1, 1, 1, 1, 1, 2, 2, 2, 2, 2) T and S equals the equi-correlation matrix with pairwise 
correlation p. The vector /j, d simulates the case in which 10 genes out of 1000 express mean differences. 
Figurc[T]dcpicts the theoretical error rates of the Fisher discriminant and the naive Bayes rule as functions 
of p. 

It is not surprising that the Fisher discriminant rule performs significantly better than the naive Bayes 
as p deviates away from zero. The error rate of the naive Bayes actually increases with p, whereas the 
error rate of the Fisher discriminant tends to zero as p approaches 1. This phenomenon is the same as 
what was shown analytically through the toy example in Section 1. To mimic Fisher discriminant by a 
plug-in estimator, we need to estimate XP 1 (i d with reasonable accuracy. This mission is difficult if not 
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Fig. 1. Misclassification rates of Fisher discriminant, naive Bayes and restricted Fisher rules (10 and 20 features, 
respectively) against p. 

impossible. On the other hand, imitating a weaker oracle is more manageable. For example, when the 
samples are of reasonable size, we can select the 10 variables with differences in means by applying a 
two-sample t-tcst. Restricting to the best linear classifiers based on these s = 10 variables, we have the 
optimal error rate 



is depicted by the sub-Fisher (10 features) in Figure [TJ It performs much better than the naive Bayes 
method. One can also employ the naive Bayes rule to the restricted feature space, but this method has 
exactly the same performance as the naive Bayes method in the whole space. Thus, the restricted Fisher 
discriminant outperforms both the naive Bayes method with restricted features and the naive Bayes 
method using all features. 

Mimicking the performance of the restricted Fisher discriminant is feasible. Instead of estimating a 
1000 x 1000 covariance matrix, we only need to gauge a 10 x 10 submatrix. However, this restricted Fisher 
rule is not powerful enough, as shown in Figure 1. We can improve its performance by including 10 most 
correlated variables to each of those selected features to further account for the correlation effect, giving 
rise to a 20-dimcnsional feature space. Since the variables arc equally correlated in this example, we are 
free to choose any 10 variables among the other 990. The performance of such an enlarged restricted 
Fisher discriminant is represented by sub-Fisher (20 features) in FigurcQ] It performs closely to the Fisher 
discriminant which uses the whole feature space, and it is feasible to implement with finite samples. 

3. Regularized Optimal Affine Discriminant 

The misclassification rate of Fisher discriminant is 1 — Q(A P ^ 2 ), where A p = /xJS _1 /x d . However, for 
high dimensional data, it is impossible to achieve such a performance empirically. Among other reasons, 
the estimated covariance matrix S is ill-conditioned or not invertible. One solution is to focus only on 
the s(<< p) most important features for classification. Ideally, the best s features should be the ones 
with the largest A s among all (^) possibilities, where A s is the counterpart of A p when only s variables 



where the classification rule is 5 m n and w 
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are considered. Naive search for the best subset of size s is NP-hard. Thus, we develop a regularized 
method to circumvent these two problems. 



3.1. ROAD 

Recall that by ©, minimizing the classification error W(5 W ) is the same as maximizing w T /i. d / (w T Sw) 1 / 2 , 
which is equivalent to minimizing w T Sw subject to w T fi d = 1. We would like to add a penalty function 
for capacity cont rol. There are ma ny ways to do regularizatio n; for the lite rature on penalized m ethods, 
refer to LASS O (jTibshirani. 19961) . S CAD (Fan and Li. 20011). Ela stic net (|Zou and Hastie. 20051) . MCP 
(jZhang. 20101 ) and related methods (|Zou. 2006c IZou and Li. 20081 ) . As our primary interest is classifi- 
cation error (the risk of the procedure), an Li constraint ||w||i < c is added for regularization, so the 
problem can be recast as 

w c = argmin w T Sw. (10) 

||w||i<c,w T /i d = l 

We name the classifier <5 Wc (-) the Regularized Optimal Affinc Discriminant (ROAD). The existence of a 
feasible solution in ([TP]) dictates 

c > 1/ max (11) 

l<i<p 

When c is small, we obtain a sparse solution and achieve feature selection using covariance information. 
When c > j|S _1 /x d ||i//xJS _1 /x d , the L\ constraint is no longer binding and i5 Wc reduces to the Fisher 
discriminant, which can be denoted by <5 Woo (= 5p). Therefore we have provided a family of linear 
discriminants, indexed by c, using from only one feature to all features. In some applications such as 
portfolio selection, the choice of c reflects the investor's tolerance upper bound on gross exposure. In 
other applications, when the user does not have a such a preference, the choice of c can be data-driven. 
To accommodate both application scenarios, we propose a coordinate descent algorithm (Section [J| to 
implement our ROAD proposal. 



3.2. Variants of ROAD 

At the sample level, NSC (jTibshirani et ai. 20021 ) and FAIR (|Fan and Fan. 20081) both use shrunken 
versions of standardized mean difference to find the s features. In the same spirit, we consider the 
following Diagonal Regularized Optimal Affinc Discriminant(D-ROAD) 5 w i, where 



argmin w diag(X)w. 

||w||i<c,w' r /i (1 = l 



(12) 



The D-ROAD will be compared with NSC (jTibshirani et ai. 20021) and FAIR (|Fan and Fan. 20081) in 
the simulation studies, and all these independence based rules will be compared with ROAD and its two 
variants defined below. 

A screening-based variant (to be proposed) of ROAD aims at mimicking the performance of sub-Fisher 
(10 features) in Figure 1. A fast way to select features is the independence screening, which uses the 
marginal information such as the two-sample t-test. We can also enlarge the selected feature subspace by 
incorporating the features which are most correlated to what have been chosen. This additional variant 
of ROAD tracks the performance of sub-Fisher (20 features) in Figure 1. We will refer to the two variants 
of ROAD as S-ROAD1 and S-ROAD2. More description of these procedures, along with their theoretical 
properties and numerical investigations, will be detailed in Sections 5 and 6. 

A hint of the rationale behind including correlated features that do not show a difference in means 
between the two classes, is revealed through the two-feature example in the introduction. Suppose fj,2 = 0. 
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Then, by ([5]), the power of the discriminant using two features is 1 — $(A^ 2 ) where A 2 = /U?/(l — p 2 ), 
whereas with the first feature alone the misclassification rate is 1 — $(A^ 2 ) where Ai = /1 2 . Therefore 
when the correlation \p\ is large, using two correlated features is far more powerful than employing only 
one feature, even though the second feature has no marginal discrimination power. More intuition is 
granted by this observation: at the population level, the best s features are not necessarily those with 
largest standardized mean differences. In other words, with the two class Gaussian model in mind, 
when S is the correlation matrix, the most powerful s features for classification are not necessarily the 
coordinates of fi d with largest absolute values. This is illustrated by the next stylized example. 

Let X|y = ~ Afin-L, S) and X\Y = 1 - J\f(fi 2 , £), where ^1 = (0, 0, 0) T , fi 2 = (4, 0.5, 1) T , and 





f 1 


-0.25 


o\ 


£ = 


-0.25 


1 







V 





1 1 



Suppose the objective is to choose 2 out of 3 variables for classification. If we rank features by marginal 
information, for example by the absolute value of standardized mean differences, then we would choose 
the 1st and 3rd features. On the other hand, denote fi d y the mean difference vector for features i and j, 
Tnj the covariance matrix of features i and j, then the classification power using features i and j depends 
on Tij = iijjjU^j Simple calculation leads to 

r 12 = 18.4> 17 = r 13 . 

Hence the most powerful two features for classification are not the 1st and 3rd. 



4. Constrained Coordinate Descent 



With a Lagrangian argument, we reformulate problem (|10[) as 



wa = argmin -w T Sw + A||w||i. (13) 

In this section, we propose a Constrained Coordinate Descent (CCD) algorithm that is tailored for 
solving our minimization problem with linear constraints. Optimization (|13[) is a constrained quadratic 
programming problem and can be solved by existing softwares such as MOSEK. Although these softwares 
are well regarded in practice, they are slow for our application. The structure of (fl3|) could be exploited 
in order to obtain a more efficient algorithm. In line with the LARS algorithm, we will exploit the fact 
that the solution path has a piecewise-linear property. 

In the compressed sensing literature, it is common to replace an affine constraint by a quadratic 
penalty. We borrow this idea and consider the following approximation to (fT3)) : 



wj j7 = argmin -w T Ew + A||w||i + — 7(w T fj, d — l) 2 . (14) 
In practice, we replace £ by the pooled sam ple covariance £ and fi by the sample mean difference vector 



fx d . By Theorem 6.7 in iRuszczvnski (20061 ). we have 



wa, 7 — > when 7 — > 00. 



Note that we do not have to enforce the affine constraint strictly, because it only serves to normalize our 
problem. In the optimization problem ([Li]) , when A = 0, the solution Wo, 7 is always in the direction of 
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E -1 fj, d . the Fisher discriminant, regardless of the value of 7. In addition, this observation is confirmed 
in the data analysis (Section 6.2) by the insensitivity of choice for 7. Therefore we hold 7 as a constant 
in practice. 

We solve (TTJ} by coordinate descent. Non-gradient algorithms seem to be less popular for convex op - 



timization. For instance, the popular textbook Convex Optimization bv lBovd and Vandenberghe (20041 ) 
does not even have a section on these methods. Coordinate descent method is an algorithm, in which 
the p search directions are just unit vectors e\, ■ ■ ■ ,e p , where et denotes the ith element in the standard 
basis of MP . These unit vectors are used as search directions in each search cycle until some convergence 
criterion is met. 

What makes the coordinate descent algorithm particularly attractive for (|14[) is that there is an explicit 
formula for each coordinate update. For a given 7, fix r and K , then do the optimization on a grid (of 
log-scale) of A values: TA max = A^ < \k-i < • • • < Ai = A max . The A max is the minimum A value such 
that no variables enter the model; this is analogous to the minimum requirement on c in (jlip . In our 
implementation, we take r = 0.001 and K = 100. The problem is solved backwards from A max . When 
A = Aj+i, we use the solution from A = Ai as the initial value. This kind of "warm start" is very effective 
in improving computational efficiency. 

Consider a coordinate descent step to solve ([HI). Without loss of generality, suppose that Wj for all 
j > 2 are given, and we need to optimize with respect to w\. The objective function now becomes 



glvx) = \ (< wj) H K) + am + A|* a |x + i 7 (wV, - 1; 



2 



When w\ 7^ 0, we have 



g'{wx) = Snwi + S12W2 + A sign(wi) + 7(w T /x d - l)/j, dl 

= (En + jfM 2 dl )wi + (E12 + 7/i rf i/xJ 2 )w 2 + A sign(wi) - 7/x di . 



By simple calculation (jDonoho and Johnstone. 19941 ) . the coordinate- wise update has the form 



S (7/Xdi - (S12 + 7MdiMl 2 ) w 2, A) 

Wl = v — ; — 2 ' 

where S(z, A) = sign(z)(|z| — A) + is the soft-thresholding operator. 

Now, we consider the convergence property of the coordinate descent algorithm. Here, although the 
objective function is not strictly convex, it is strictly convex in each of the coordinates. 

To show g(w\) is strictly convex in w\, we decompose it as follows: 



5(^1) = Si +52(101), 



where 52(101) = A|wi| and 



Note that g\[w-y) is a quadratic function of W\ and g"(wi) = En + 7/i^ > for all Wi G M. Therefore, 
the function gi(-) is strictly convex on K. Also, it is clear that 32 is convex on R. Therefore g = g± + g-2 
is a strictly convex function on K. 

Combining the coordinate- wise strict con vexity with th e fact that the non-differentiable part of the ob- 



jective function is separable, Theorem 5.1 of lTseng (20011 ) guarantees that coordinate descent algorithms 
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converge to coordinate-wise minima. Moreover, since all directional derivatives exist, every coordinate- 
wise minimum is also a lo cal minimum. A similar st udy on the convergence of the coordinate descent 
algorithm can be found in iBrehenv and Huang f201lh . 

In each coordinate update, the computational complexity is 0(p). A complete cycle through all p 
variables costs 0(p 2 ) operations. From our experience, CCD converges quickly after a few cycles if "warm 
start" is used for the initial solution. Let C denote the average number of cycles until convergence for 
each A. Then our algorithm CCD enjoys computational complexity 0(CKp 2 ). The D-ROAD can be 
similarly implemented by replacing the covariance matrix with its diagonal. 



5. Asymptotic Property 

5. 1 . Risk Approximation 

Let w c be a sample version of w c in (|10[) , 

w c G argmin w T Sw. (15) 

||w||i<c,w T /j, d = l 

The fact that S is only positive semi-definite leads to potential non-uniqueness of w c . Now, we have three 
different classifiers: 5 w=o = I{w^ (X-/zJ > 0}, c5 Wc = I{wf (X-/iJ > 0} and <5 Wc = I{wJ(X-/i ) > 0}. 
The first two are oracle classifiers, requiring the knowledge of unknown parameters /ti x , fi 2 an d S, while 
the third one is the feasible classifier, ROAD, based on the sample. Their classification errors are given 
by ([2|). Explicitly, the error rates are respectively W(<5 Woo ) [see fl3J|], W(S v/c ), and W(<5 Wc ). By ([2]), an 
obvious estimator of the misclassification rate of <5 Wc is 

w n (L c ) = i - <f f TT^bd • ( 16 ) 



(W^Sw c )i/2 



Two questions arise naturally: 



(i) how close is W(S Wc ), the misclassification error of <5 Wc , to that of its oracle W(£ Wc )? 

(ii) does W n {8 VIc ) estimate W(5 VJc ) well? 

Theorem [T] addresses these two questions. We introduce an intermediate optimization problem for con- 
venience: 

w^ 1 -* = argmin w T Sw. 

|w||i<c.w T /i d = l 

Theorem 1. Let s c = ||w c || . = Iw^^lo, and s c = ||w c || . Assume that A m i„(S) > a 2 > 0, 
Op(a n ) and \\fi d — /^^lloo = O p (a n ) for a given sequence a n — > 0. Then, we have 



W(6 w J-W(6 Wc ) = O p (d n ), 



and 



w n (L c )-w(L c ) = o p (b n ), 

where b„ = i^c 2 V s c V s^ 1 ^ a n and d„ = b„ V (s c a„). 

Remark 1. In Theorem ]]^ || • ||oo is the element wise super-norm. When S is the sample covariance, 



b mBickel and Levina (200A ), ||S — S||oo = O p (y/ (log p)/n); hence we can take a n = (log p)/n. The 
first result in Theorem Q] shows the difference between the misclassification rate of (5 We and its oracle 
version <5 Wc ; the second result says about the error in estimating the true misclassification rate of ROAD. 
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Remark 2. In view of one intends to choose a w that makes w T Sw small and w T fi d large. A 
compromise of these dual objectives leads to a utility function 



U(w) = -w T £w + £/xJw, 

as a proxy of the objective function for a fixed £. For any £ > 0, the optimal choice w* £ argmin U(w) 
leads to the Fisher discriminant rule. Consider also the regularized versions 

w* = argmin j | w || i < c C/(w), and w* = argmin|| w y I < c C/'(w), 

where t/(w) is the utility function with S and fi d estimated by £ and fi d . Then, it is easy to see the 
following utility approximation: for any ||w||i < c 

|C/(W) - #(w)| < ||± - n\ooC 2 + ^cllAd - Mdlloo 

and 

|[/(w c *) - C/(w*)| < 2(||± - S|Uc 2 + - Mdl 

Remark 3. T/ie mosi prominent technical challenge of our original problem M0\) is due to different 
dualities of penalization problems. For the population version \10\) . it can be reduced, by the Lagrange 
multiplier method, to the utility U (w) optimization problem in Remark 2 with a given £ > 0, while for 
the sample version i U<5|) , it can be reduced to the utility U(mv) optimization problem with a different £. 
Therefore, the problem is not the same as the utility optimization problem in Remark 2: £ is hard to 
bound. In fact, it is much harder and yields more complicated results. 

We now show how different the data projection direction in the regularized oracle can be from that in 
the Fisher discriminant. To gain better insight, we reformulate the L\ constraint problem as the following 
penalized version: 

w A = argmin w T Sw + A||w||i. (17) 

w:^tjw— 1 

The following characterizes its convergence to the Fisher discriminant weight w M as A — > 0. 
Theorem 2. Let s be the size of the set {k : ^ 0}. Then, we have 

||W* -Woolly 



A m in(5]) 

where = S _1 /x d /(/xJS _1 /x d ) is the normalized Fisher discriminant, optimizing with A = 0. 



5.2. Screening-based ROAD (S-ROAD) 

Following the idea of Sure Independence Screening in lFan and Lv f2008f ). we pre-screen all the features 
before hitting the ROAD. The advantage of this two-step procedure is that we have a control on the total 
number of features used i n the final classification rule. A popular met hod for independent feature selection 
is the two-sa mple t-test ( Tibshirani et al, 2002t iFan and Fan. 20081 ) , which is a specific case of marginal 



screening 



Fan and Lv (20081) . The sure screening property of such a method was demonstrated in 



Fan and Fan (20081 ). which selects consistently the features with different means in the same settings as 



ours. 



Once the features are selected, we can hit the ROAD, producing the vanilla Screening-based Regu- 
larized Optimal Affinc Discriminant (S-ROAD1): 
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(1) Employ a screening method to get k features. 

(2) Apply ROAD to the k selected features. 

In the first step, we use the t-statistics as the screening criteria and determine a d ata-driven threshold 



This idea is motivated by a FDR criterion for choosing marginal screening threshold in lZhao and Li (20101 ) . 
A random permutation tt of {1, ■ • • , n} is used to decouple and Yi so that the resulting data (Xj^j, Y{) 
follow a null model, by which we mean that features have no prediction power for the class label. More 
specifically, the screening step is carried out as follows: 

(i) Calculate the i-statistic tj for each feature j, where j = 1, ■ ■ ■ , p. 

(ii) For the permuted data pairs (X^^, Yi), recalculate the i-statistic t*, for j = 1, • • • ,p. (Intuitively, 
if j is the index of an important feature, \tj\ should be larger than most of \t*\, because the random 
permutation is meant to eliminate the prediction power of features.) 

(iii) For q € [0,1], let o>( ? ) be the q th quantile of = 1,2, ••• ,p}. Then, the selected set A is 
defined as 

A = {j\\t j \>u iq) }. 

The choice of threshold is made to retain the features whose i-statistics are significant in the two sample 
t-test. Alternatively, if the user knows his k, (due to budget constraints, etc.), then he can just rank |t, |'s 
and choose the threshold accordingly. 

The S-ROAD1 tracks the performance of oracle procedures like sub-Fisher (10 features) in Figure 
1. The feature space gotten by step (1) can be expanded by including those features which are most 
correlated with what have already been selected. This additional variant, S-ROAD2, aims at achieving 
the performance of sub-Fisher (20 features) type of procedure in Figure 1. 

To elaborate on the theoretical properties of S-ROAD1, assume with no loss of generality that the 
first k variables are selected in the screening step. Denote by the upper left k x k block of £ and fi k 
the first k coordinates of fi d . Let 

/3 C = argmin /3 T E fe /3. 

\\P\\i<cP T » k = l 



The quantities /3 C and (3^ are defined similarly to w c and w?' (defined right before Theorem 1). Then 
denote by y c = (/3^,0 T ) T , y c = (/3 C ,0 T ) T and = (w^P ,0 T ) T . The next two theorems can be 
verified along lines similar to Theorems 1 and 2. Hence, the proofs are omitted. 



Theorem 3. IfWEk-^kW^ = O p ( v /log fc/n), ||/i fc -/i fe || 00 = O p {^f\ogk/n), and A min (S i: ) > S > 0, 
then we have 

W(S y J-W(6 v J = O p (e n ), 

and 

W n (S y J-W(S yc ) = O p (e n ), 

where e„ = (c 2 V k)^J^^- . 

This result is cleaner than Theorem [TJ as the rate does not involve s c and s c : they are simply replaced 
by the upper bound k. Accurate bounds for s c and s c are of interest for future exploration, but they are 
beyond the scope of this paper. 
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Theorem 4. Let y\ = argmin J/ . /x T y=1 ye M k R{y) + -^llz/lli where Mk is the subspace in R p with the 
last p — k components being zero, and y° = ((S ; 7 1 /x fc ) T /(/i^S^ 1 /x fc ), T ) T . Then we have 

\\yt-y°h< 



Amin(Sfe) 



5.3. Continuous Piecewise Linear Solution Path 

We use the word "linear" when referring to "affinc" , in line with the status quo in the statistical com- 
munity. Continuous piecewise linear paths are of much interest to statisticians, as the property reduces 
the computational complexity of solutions and just ifies the linear inte rpol ations of solutions at d iscrete 



points. Previous well known investigations include lEfron et al. (20041 ) and lRosset and Zhu (20071 ). Our 



setup differs from others mainly in that in addition to a complexity penalty, there is also an affine con- 
straint. Our proof calls in point set topology, and is purely geometrical, in a spirit very different from 
the existing ones. In particular, we stress that the continuity property is intuitively correct, but it is far 
from a trivial consequence of the assumptions. The authors also believe that the claim holds true even if 
the p — 1 dimensional affinc subspace constraint is replaced by more generic ones, though the technicality 
of the proof must be more involved. 

Theorem 5. Let fi d G W be a constant, and S be a positive definite matrix of dimension p x p. Let 

w c = argmin w T Sw, 

||w||i<c,w T /i d = l 

then w c is a continuous piecewise linear function in c. 

Proposition 1. W{5 vfa ) is a Lipschitz function in c. 
PROOF. Recall that 

w{& Vc ) = i - $ (i/ork)) 1 / 2 ) . 

By Theorem [3 and the fact that composition of Lipschitz functions is again Lipschitz, the conclusion 
holds. 



6. Numerical Investigation 

In this section, several simulation and real data studies are conducted. We compare ROAD and its 
variants S-ROAD1 (Screening-based ROAD version 1), S-ROAD2 (Screening-based ROAD version 2) 
and D-ROAD (Diagonal ROAD) with NSC (Nearest Shrunken Centroid), SCRDA (Shrunken Centroids 
Regularized Discriminant Analysis), FAIR (Feature Annealed Independence Rule), NB (Naive Bayes), 
NFR (Naive Fisher Rule, which uses the generalized inverse of the sample covariance matrix), as well as 
the Oracle. 

In all simulation studies, the number of variables is p = 1000, and the sample size of the training and 
testing data is n = 300 for each class. Each simulation is repeated 100 times to test the stability of the 
method. Without loss of generality, the mean vector of the first class is set to be 0. We use five-fold 
cross-validation to choose the penalty parameter A. 
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Fig. 2. Solution Path for ROAD (left panel) and D-ROAD (right panel). Equal correlation setting (p = 0.5), Sparse 
Signal (s = 10) as in Section 101 




Table 1. Equal correlation setting, fixed signal: Median of the percentage for testing classification error and standard 
deviations (in parentheses). Signal all equal to 1 . s {) = 10. 



p 


ROAD 


S-ROAD1 


S-ROAD2 


D-ROAD 


SCRDA 


NSC 


FAIR 


NB 


Oracle 





6.0(1.2) 


6.0(1.1) 


6.0(1.2) 


5.7(1.1) 


6.3(1.0) 


5.9(1.0) 


5.7(1.0) 


11.2(1.4) 


5.5(1.1) 


0.1 


6.3(2.5) 


12.2(5.0) 


8.8(2.4) 


11.6(5.1) 


10.3(1.4) 


11.1(3.0) 


12.4(1.4) 


26.8(10.1) 


5.0(0.9) 


0.2 


5.3(1.0) 


16.0(6.3) 


8.7(2.5) 


16.1(7.5) 


8.5(1.2) 


14.5(4.3) 


17.3(1.7) 


34.8(11.6) 


4.0(0.8) 


0.3 


4.2(0.9) 


19.1(7.9) 


7.8(2.6) 


19.1(9.4) 


6.6(1.1) 


17.1(5.5) 


20.8(1.7) 


39.3(12.3) 


3.2(0.7) 


0.4 


3.2(0.8) 


22.8(9.4) 


6.5(2.6) 


22.2(9.9) 


4.8(1.0) 


20.5(6.1) 


23.2(1.8) 


41.6(11.3) 


2.0(0.6) 


0.5 


2.0(0.6) 


25.8(11.0) 


4.8(1.4) 


25.2(10.2) 


2.9(0.7) 


23.2(6.0) 


25.3(1.6) 


43.5(11.1) 


1.3(0.5) 


0.6 


1.0(0.4) 


18.3(12.4) 


3.3(1.3) 


28.1(10.3) 


1.5(0.5) 


25.8(5.7) 


26.8(1.8) 


44.4(12.1) 


0.7(0.3) 


0.7 


0.3(0.2) 


15.5(13.6) 


1.7(1.0) 


29.1(10.1) 


0.5(0.3) 


27.0(8.2) 


28.2(2.0) 


45.2(12.3) 


0.2(0.2) 


0.8 


0.0(0.1) 


5.0(14.0) 


0.3(0.4) 


29.5(9.9) 


0.0(0.1) 


28.3(8.7) 


29.2(2.0) 


46.2(10.3) 


0.0(0.1) 


0.9 


0.0(0.0) 


0.6(14.8) 


0.0(0.1) 


30.3(7.6) 


0.0(0.2) 


29.9(8.0) 


30.2(1.9) 


46.8(8.8) 


0.0(0.0) 



6. 1. Equal Correlation Setting, Sparse Fixed Signal 

In this subsection, we consider the setting where = 1 for all i = 1, • ■ ■ , p and E^j = p for all 
i, j = 1, • ■ ■ ,p and i ^ j, and take fi 2 to be a sparse vector: fi 2 = (3-xo) Oggo) T i where 1^ is a length d 
vector with all entries 1, 0^ is a length d vector with all entries 0, where the sparsity size is so = 10. 
Also, we fix 7 = 10 in (fT4")l for this simulation. Sensitivity of the performance due to the choice of 7 will 
be investigated in the next subsection. 

The solution paths for ROAD and D-ROAD of one realization are rendered in Figure [5J It is clear 
from the figure that, as the penalty parameter decreases (index increases), both ROAD and D-ROAD 
use more features. Also, the cutoff point for D-ROAD, where the number of features starts to increase 
dramatically, tends to come later than that for ROAD. 

The simulation results for the pairwise correlations ranging from to 0.9 are shown in Tables [T] and [2] 
We would like to mention that the results for NFR (Naive Fisher Rule) are not included in these (and the 
subsequent) tables because the test classification error is always around 50%, i.e., it is about the same as 
random guess. Also in the tables are the screening-based versions of the ROAD. S-ROAD1 refers to the 
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Table 2. Equal correlation setting, fixed signal: Median of number of nonzero coefficients and standard deviations (in 
parentheses). Signal all equal to 1 . s = 10. 



p 


ROAD 


S-ROAD1 


S-ROAD2 


D-ROAD 


SCRDA 


NSC 


FAIR 





16.00(24.16) 


10.00(1.31) 


17.00(4.31) 


29.50(58.54) 


10.00(13.25) 


10.00(44.86) 


11.00(1.62) 


0.1 


117.50(30.50) 


11.00(3.32) 


21.00(4.15) 


14.00(122.02) 


1000.00(345.48) 


35.50(117.32) 


10.00(0.27) 


0.2 


130.50(33.33) 


11.00(6.99) 


22.00(8.98) 


15.50(111.42) 


1000.00(0.00) 


95.00(120.17) 


10.00(0.69) 


0.3 


136.50(36.16) 


11.00(11.56) 


22.00(10.38) 


17.50(106.16) 


1000.00(0.00) 


103.50(117.52) 


9.00(1.19) 


0.4 


135.00(34.43) 


10.00(14.21) 


22.00(17.07) 


10.00(98.10) 


1000.00(0.00) 


70.00(131.65) 


8.00(1.33) 


0.5 


138.50(38.17) 


9.00(21.71) 


22.00(21.56) 


10.00(105.33) 


1000.00(0.00) 


65.00(137.97) 


7.00(1.30) 


0.6 


148.00(49.74) 


10.50(27.92) 


22.00(31.88) 


10.00(110.23) 


1000.00(0.00) 


38.00(141.91) 


6.00(1.30) 


0.7 


170.50(52.29) 


11.00(37.37) 


22.00(41.76) 


1.00(118.43) 


1000.00(0.00) 


27.50(140.10) 


5.00(1.20) 


0.8 


203.00(27.72) 


12.00(50.36) 


24.00(59.23) 


1.00(143.83) 


1000.00(10.92) 


15.00(157.98) 


5.00(1.29) 


0.9 


151.50(8.02) 


14.00(55.32) 


28.00(50.45) 


1.00(153.27) 


1000.00(56.30) 


14.00(225.38) 


3.00(1.08) 



vanilla version where we first apply the two-sample i-test to select any features with the corresponding 
t-test statistic with absolute value larger than the maximum absolute t-test statistic value calculated on 
the permuted data. S-ROAD2 does the same except for each variable in S-ROADl's pre-screened set, it 
adds an additional variable which is most correlated with that variable. Figure [3J a graphical summary 
of Table 1, presents the median test errors for different methods. We can see from Tabled] and Figure |3] 
that the oracle classification error decreases as p increases. This phenomenon is due to a similar reason 
to the two-dimensional showcase in the introduction. When p goes to 1, all the variables contribute in 
the same way to boost the classification power. ROAD performs reasonably close to the Oracle, while 
working independence based method such as D-ROAD, NSC, FAIR and NB fail when p is large. The 
huge discrepancy shows the advantage of employing the correlation structure. Since SCRDA also employ 
the correlation structure, it does not fail when p is large. However, ROAD still outperforms SCRDA in 
all the correlation settings. S-ROAD1 and S-ROAD2 both have misclassification rates similar to that 
of ROAD. It is worth to emphasize that the merits of the screening based ROADs mainly lie in the 
computation cost, which is reduced significantly by the pre-screening step. 

The ROAD is a very robust estimator. It performs well even when all the variables are independent, 
in which case there could be a lot of noise for fitting the covariance matrix. Table Q] indicates that ROAD 
has almost the same performance as D-ROAD, NSC and FAIR under the independence assumption, 
i.e. p = 0. As p increases, the edge of ROAD becomes more substantial. In general, the ROAD is 
recommended on the grounds that even with pairwise correlation of about 0.1 (which is quite common 
in microarray data as well as financial data), the gain is substantial. 

Another interesting observation is that the D-ROAD performs similarly to NSC and FAIR in terms 
of classification error. An intuitive explanation is that they are all "sparse" independence rules. NSC 
uses soft-thresh olding on the standard ized sample mean difference, and its equivalent LASSO derivation 
can be found in IWang and Zhu (20071 ) . FAIR selects features with large marginal t-statistics in absolute 
values, while D-ROAD is another LI penalized independence rule, whose implementation is different from 
NSC. 

Table 2 summarizes the number of features selected by different classifiers. Note that ROAD mimics 
Fisher discriminant coordinate S _1 /x rf , which has p = 1000 nonzero entries under our simulated model. 
Therefore, the large number of features selected by ROAD is not out of expectation. 
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Fig. 3. Median classification error as a function of p in the equi-correlation matrix. Sparse n d as in Section lPl 




6.2. The Effect of 7 

Under the settings of the previous subsection, we look into the variation of the ROAD performance as 
7 changes. In Table [H the number of active variables varies; however, the median classification error 
remains about the same for a broad range of 7 values. The reason is that the cross validation step chooses 
the "best" A according to a specific 7. Therefore, the final performance remains almost unchanged. Since 
our primary concern is the classification error, we fix 7 = 10 for simplicity in the subsequent simulations 
and in the real data analysis. 



6.3. Block Diagonal Correlation Setting, Sparse Fixed Signal 

In this subsection, we follow the same setup as in Section 16.11 except that the covariance matrix X is 
taken to be block diagonal. The first block is a 20 x 20 equi-correlated matrix and the second block is 
a (p — 20) x (p — 20) equi-correlated matrix, both with pairwisc correlation p. In other words, Ej^ = 1 
for all i = 1, • • ■ ,p, Sjj = p for all i,j = 1, ■ ■ ■ ,20 and i 7^ j, Sjj = p for all i, j = 21, ■ • • ,p and i 7^ j, 
and the rest elements are zeros. As before, we examine the performances of various estimators when p 
varies. The percentage for testing error and the number of selected features in the estimators are shown 
in Tables [4] and [5j respectively. 

In this block-diagonal setting, we have observed similar results to those in Section |6~T1 ROAD and 
S-ROAD2 perform significantly better than the other methods. One interesting phenomenon is that S- 
ROAD1 does not perform well when p is large. The reason is that the current true model has 20 important 
features, and by looking only at marginal contribution, S-ROAD1 misses some important variables, as 
shown in Table 2J Indeed, because those features have no expressed mean differences, it does not fully 
take advantage of highly correlated features. In contrast, S-ROAD2 is able to pick up all the important 
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Table 3. Equal correlation setting; signals all equal to 1 ; s = 10. Results for 
different 7. 









P = 


p = 0.5 


p = 0.9 


Median 
classification 
error (in 
percentage) 


R0AD 7 . 
R0AD 7= 
R0AD 7 . 


=0.01 

=0.1 

=1 


5.8(1.2) 
6.0(1.2) 
6.0(1.3) 


2.7(0.6) 
2.0(0.6) 
2.0(0.6) 


0.2(0.2) 
0.2(0.1) 
0.0(0.1) 


R0AD 7 . 
ROAD 7= 


=10 
=100 


6.0(1.2) 
6.2(1.2) 


2.0(0.6) 
2.3(0.6) 


0.0(0.0) 
0.0(0.1) 








p = 


p = 0.5 


p = 0.9 




R0AD 7= 


=0.01 


14.0(19.2) 


129.5(42.5) 


657.0(179.6) 


Median 


R0AD 7 . 


=0.1 


14.0(19.6) 


137.0(37.6) 


773.5(103.2) 


number of 


R0AD 7 . 


=1 


16.5(22.9) 


139.0(37.9) 


514.0(39.7) 


nonzeros 


R0AD 7= 


=10 


16.0(24.2) 


138.5(38.2) 


151.5(8.0) 




ROAD 7= 


=100 


22.0(16.1) 


114.5(9.4) 


94.0(9.6) 



Table 4. Block diagonal correlation setting, sparse fixed signal: Median of the percentage for testing classification 



error and standard deviations (in parentheses). Signal all equal to 1 . s = 10. 



p 


ROAD 


S-ROAD1 


S-ROAD2 


D-ROAD 


SCRDA 


NSC 


FAIR 


NB 


Oracle 





6.0(1.2) 


6.0(1.1) 


6.0(1.2) 


5.7(1.1) 


6.0(0.1) 


5.5(0.3) 


5.7(1.0) 


11.2(1.4) 


5.5(1.1) 


0.1 


10.8(3.6) 


13.0(4.8) 


10.3(3.0) 


12.8(4.4) 


13.0(0.3) 


12.5(0.8) 


12.7(1.5) 


25.7(7.6) 


8.8(1.2) 


0.2 


10.7(4.1) 


18.0(5.7) 


9.7(3.6) 


17.7(5.9) 


14.2(1.1) 


17.2(0.4) 


17.7(1.6) 


34.4(7.9) 


8.8(1.2) 


0.3 


9.5(3.8) 


23.2(5.5) 


8.8(4.0) 


23.2(5.6) 


12.7(0.9) 


20.0(0.8) 


20.4(1.6) 


38.3(7.5) 


7.7(1.0) 


0.4 


8.0(3.3) 


29.7(4.2) 


7.5(4.2) 


29.3(4.1) 


11.0(1.2) 


23.8(1.3) 


23.2(1.8) 


41.0(6.9) 


6.6(1.1) 


0.5 


6.2(2.6) 


30.1(3.9) 


5.7(0.9) 


30.0(3.1) 


8.7(0.4) 


26.2(1.7) 


25.1(1.7) 


42.2(6.6) 


5.0(1.0) 


0.6 


4.2(0.9) 


30.3(4.2) 


4.0(0.8) 


30.3(2.2) 


6.4(0.1) 


26.5(1.2) 


26.8(1.8) 


43.6(7.0) 


3.5(0.7) 


0.7 


2.3(0.7) 


30.0(6.4) 


2.2(0.7) 


30.6(2.1) 


2.5(0.7) 


28.1(3.2) 


28.2(2.0) 


44.2(6.5) 


1.8(0.6) 


0.8 


0.8(0.4) 


29.8(9.8) 


0.7(0.4) 


30.6(2.1) 


0.6(0.4) 


29.2(1.6) 


29.2(2.0) 


44.8(5.7) 


0.7(0.3) 


0.9 


0.0(0.1) 


29.8(12.8) 


0.0(0.1) 


30.6(1.9) 


0.2(0.2) 


29.2(1.2) 


30.2(1.9) 


45.2(4.9) 


0.0(0.1) 



Table 5. Block diagonal correlation setting, fixed signal: Median of number of nonzero coefficients and standard 
deviations (in parentheses). Signal all equal to 1 . s = 10. 



p 


ROAD 


S-ROAD1 


S-ROAD2 


D-ROAD 


SCRDA 


NSC 


FAIR 





16.00(24.16) 


10.00(1.31) 


17.00(4.31) 


29.50(58.54) 


10.00(1.15) 


10.00(1.73) 


11.00(1.62) 


0.1 


48.50(35.99) 


10.00(2.73) 


20.00(3.77) 


14.00(26.73) 


33.00(17.79) 


65.00(38.84) 


18.00(2.67) 


0.2 


48.00(31.48) 


10.00(4.59) 


20.00(5.84) 


10.00(18.23) 


38.00(117.54) 


10.00(16.17) 


18.00(2.77) 


0.3 


47.50(42.75) 


9.00(5.28) 


20.00(6.03) 


10.00(11.80) 


208.00(103.94) 


10.00(13.58) 


18.00(3.91) 


0.4 


40.50(32.42) 


1.00(4.82) 


20.00(10.08) 


1.00(9.25) 


27.00(90.95) 


33.00(14.22) 


17.00(5.43) 


0.5 


40.50(33.23) 


1.00(4.88) 


20.00(10.10) 


1.00(8.51) 


24.00(76.79) 


10.00(1.15) 


7.00(5.98) 


0.6 


39.50(30.03) 


1.00(3.74) 


20.00(14.53) 


1.00(5.92) 


127.50(6.36) 


6.50(2.12) 


6.00(5.98) 


0.7 


40.00(41.35) 


1.00(4.71) 


20.00(8.07) 


1.00(2.49) 


94.50(2.12) 


9.50(0.71) 


5.00(5.52) 


0.8 


55.00(58.67) 


1.00(6.20) 


20.00(18.32) 


1.00(0.93) 


58.00(2.83) 


6.00(5.66) 


5.00(4.84) 


0.9 


120.00(30.66) 


1.00(21.29) 


20.00(30.46) 


1.00(0.35) 


20.00(0.00) 


8.00(2.83) 


3.00(3.81) 
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Table 6. Block-Diagonal Negative Correlation Setting, Sparse Fixed Signal: Median error (in percentage) and 
number of nonzero coefficients with standard deviations in parentheses. 





ROAD S-ROAD1 S-ROAD2 D-ROAD SCRDA 


NSC 


FAIR 


NB 


Oracle 


error 
nonzero 


7.3(3.4) 16.0(5.2) 12.7(3.4) 17.8(8.0) 18.5(1.1) 
168.00(47.59) 10.00(2.40) 20.00(3.58) 15.50(15.32) 24.00(0.58) 


20.8(0.6) 
41.00(17.90) 


24.8(2.1) 
59.00(4.27) 


33.5(2.1) 


3.2(0.7) 


Table 7. Random correlation setting, double exponential signal: Median error I 
nonzero coefficients with standard deviations in parentheses. 


In percentage) and number of 




ROAD S-ROAD1 S-ROAD2 D-ROAD SCRDA 


NSC 


FAIR 


NB 


Oracle 


error 
nonzero 


2.0(0.6) 11.0(5.2) 5.8(3.9) 17.0(2.2) 5.2(1.1) 
83.00(39.54) 4.00(8.13) 9.00(10.69) 1.00(3.89) 1000.00(0.00) 


16.2(1.3) 
4.00(0.58) 


17.0(1.6) 
1.00(0.17) 


46.2(2.4) 


1.3(0.5) 



variables, takes advantage of correlation structure, and leads to a sparser model than the vanilla ROAD. 
In view of the results from this simulation setting and the previous one, we recommend S-ROAD2 over 
S-ROAD1. 

6.4. Block-Diagonal Negative Correlation Setting, Sparse Fixed Signal 

In this subsection, we again follow a similar setup as in Section 16.11 Here, the covariance matrix S is 
taken to be block diagonal with each block size equals to 10. Each block is an equi-correlated matrix 
with pairwise correlation p = —0.1. In other words, S = diag(So, ■ ■ • , So), where So is a 10 x 10 
equi-correlated matrix with correlation —0.1. Here, /j, 2 = 0.5 x (ljT, 0|\ lj , 0|g 5 ) T and the sparsity size 
is so = 10. As before, we examine the performances of various estimators when p varies. The percentage 
for testing error and the number of selected features in the estimators are shown in Table El 

6.5. Random Correlation Setting, Double Exponential Signal 

To evaluate the stability of the ROAD, we take a random matrix S as the correlation structure, and 
use a signal /j, whose nonzero entries come from a double exponential distribution. A random covariance 
matrix S is generated as follows: 

(i) For a given integer m (here we take m = 10), generate a p x m matrix ft where f2,j ~ Unif(— 1, 1). 
Then the matrix f2f2 T is positive semi-definite. 

(ii) Denote Co = niini(nf2 T )^. Let S = Jlf2 T + Cfjl, where I is the identity matrix. It is clear that S 
is positive definite. 

(iii) Normalize the matrix S to get S whose diagonal elements are unity. 

For the signal, we take fi to be a sparse vector with sparsity size s = 10, and the nonzero elements are 
generated from the double exponential distribution with density function 

/(x) = exp(-2|z|). 

Table [7] summaries the results. It shows that even under random correlation setting and random 
signals, our procedure ROAD still outperforms other competing classification rules such as SCRDA, NSC 
and FAIR in terms of the classification error. 

6.6. Real Data 

Though the ROAD seems to perform best in a broad spectrum of idealized experiments, it has to be 
tested against reality. We now evaluate the performance of our newly proposed estimator on three popular 
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Table 8. Classification error and number of selected genes by various methods of leukemia data. 
Training and testing samples are of sizes 38 and 34, respectively. 





ROAD 


S-ROAD1 


S-ROAD2 


SCRDA 


FAIR 


NSC 


NB 


Training Error 











1 


1 


1 





Testing Error 


1 


3 


1 


2 


1 


3 


5 


No. of selected genes 


40 


49 


66 


264 


11 


24 


7129 



Table 9. Classification error and number of selected genes by various methods of lung cancer 
data. Training and testing samples are of sizes 32 and 149, respectively. 





ROAD 


S-ROAD1 


S-ROAD2 


SCRDA 


FAIR 


NSC 


NB 


Training Error 


1 


1 


1 











6 


Testing Error 


1 


4 


1 


3 


7 


10 


36 


No. of selected genes 


52 


56 


54 


2410 


31 


38 


12533 



gene expression data sets: "Leukemia" (IGolub et pi., 1999lh "Lung Cancer" (jGordon et gi, 20021 ). and 
"Neuroblastoma data set" (jOberthuer et ai. 2006T) . The first two data sets come with predetermined, 
separate training and test sets of data vectors. The Leukemia data set contains p = 7, 129 genes for 
ri\ = 27 acute lymphoblastic leukemia (ALL) and ?i2 = H acute myeloid leukemia (AML) vectors in the 
training set. The test set includes 20 ALL and 14 AML vectors. The Lung Cancer data set contains 
p = 12, 533 genes for n\ = 16 adenocarcinoma (ADCA) and ri2 = 16 mesothelioma training vectors, 
along with 134 ADCA and 15 mesothelioma test vectors. The Neuroblastoma data set, obtained via 
the MicroArray Quality Control phase-II (MAQC-II) project, consists of gene expression profiles for 
p = 10, 707 genes from 251 patients of the German Neuroblastoma Trials NB90-NB2004, diagnosed 
between 1989 and 2004. We analyzed the gene expression data with the 3-year event-free survival (3-year 
EFS), which indicates whether a patient survived 3 years after the diagnosis of neuroblastoma. There are 
239 subjects with the 3-year EFS information available (49 positives and 190 negatives). We randomly 
select 83 subjects (19 positives and 64 negatives, which are about one third of the total subjects) as the 
training set and the rest as the test set. The readers can find more details about the data sets in the 
original papers. 

Following iDudoit et al. (20021 ) and iFan and Fan (2008T ). we standardized each sample to zero mean 
and unit variance. The classification results for ROAD, S-ROAD1, S-ROAD2, SCRDA, FAIR, NSC and 
NB are shown in Tables [SJ and [TJJJ For the leukemia and lung cancer data, ROAD performs the best 
in terms of classification error. For the neuroblastoma data, NB performs best, however, it makes use 
of all 10,707 genes, which is not very desirable. In contrast, ROAD has a competitive performance in 
terms of classification error and it only selects 33 genes. Although SCRDA has a close performance, the 
number of selected variables varies a lot for the three data set (264, 2410, 1). Overall, ROAD is a robust 
classification tool for high-dimensional data. 



Table 1 0. Classification error and number of selected genes by various methods of neuroblastoma 
data. Training and testing samples are of sizes 83 and 163, respectively. 





ROAD 


S-ROAD1 


S-ROAD2 


SCRDA 


FAIR 


NSC 


NB 


Training Error 


3 


22 


14 


16 


15 


16 


14 


Testing Error 


33 


47 


.37 


37 


44 


35 


32 


No. of selected genes 


33 


1 


9 


1 


18 


41 


10707 
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7. Discussion 

With a simple two-class gaussian model, we explored the bright side of using correlation structure for high 
dimensional classification. Targeting directly on the classification error, ROAD employs un-regularized 
pooled sample covariance matrix and sample mean difference vector without suffering from curse of 
dimensionality and noise accumulation. The sparsity of chosen features is evident in simulations and 
real data analysis; however, we have not discovered intuitively good conditions on X and fi d , such that a 
certain desirable sparsity pattern of w c follows. We resolve a part of the problem by introducing screening- 
based variants of ROAD, but the precise control of the sparsity size is worth for further investigation. 
Furthermore, we can explore the conditions for the model selection consistency. 

In this paper, we have restricted ourselves to the linear rules. They can be easily extended to nonlinear 
discriminants via transformations such as low order polynomials or spline basis func tions. One may also 



use the popular "kernel tricks" in the machine learning community. See, for example. lHastie et al. f2009T ) 
for more details. After the features are transformed, we can hit the ROAD. One essential technical 
challenge of the current paper is rooted in a stochastic linear constraint . The precise role of this constraint 
has not been completely pinned down. In the following, a preliminary proposal is provided for extending 
ROAD to multi-class settings. 



7. 1. Extension to Multi-Class 

In this section, we outline an extension of ROAD to multi-class classification problems. Suppose that 
there are K classes, and for j = 1, • • • , K , the jth class has mean ■ and common covariance S. Denote 
the overall mean of features by /J, a = K^ 1 Y^j—i V-y Fisher's reduced rank approach to multi-class 
classification is a minimum distance classifier in some lower dimensional projection space. The first step 
is to find s < K—l discriminant coordinates (wj, • • • , w*) that separate the population centroids 
the most in the projected space S = span{w|;, • ■ ■ , w*}. Then the population centroids /Liy's and new 
observation X are both projected onto S. The observation X will be assigned to the class whose projected 
centroid is closest to the projection of X onto S. Note that it is usually not necessary to compute all 
K—l discriminant coordinates whose span is that of all K population centroids; the process can stop as 
long as the projected population centroids are well spread out in S. 

We adopt the above procedure for multi-class classification. However, the large-p-small-n scenario 
demands regularization in selecting discriminant coordinates. Indeed, in the Fisher's proposal the first 
discriminant coordinate is the solution of 

w'Bw 



(18) 



where B = , i' T \E', and the jth column of * T is (^i ■ — fx a ). Note that a multiple of B is the between- 
class variance matrix. The second discriminant coordinate is the maximizer of w T Bw/(w T Sw) with 
constraint wJ T Sw = 0, and the subsequent discriminant coordinates are determined analogously. 

Since solving (TT8"|) is the same as looking for the eigenvector of S~ 1 / 2 BS -1 ^ 2 corresponding to the 
largest eigenvalue, diverging spectrum and noise accumulation have to be considered when we work on 
the sample. To address these issues, we regularize w as in the binary case, 

min w T Sw, (19) 

||w||i<c,w T Bw=l 

whose solution is t he first regularized discri minant coordinate . Here, equation (|19[) is related to the null 
space method in (jKrzanowski et al. 19951) . The second regularized discriminant coordinate is obtained 
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by solving (|19|) with additional constraint wJ r Sw = 0. Other regularized discriminant coordinates can 
be found similarly. With these s (< K—l) regularized discriminant coordinates, the classifier is now based 
on the minimum distance to the projected centroids in the s-dimensional space spanned by {w*}| =1 . 

The implementation and theoretical properties for multi-class ROAD arc interesting topics for future 
research. 
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A. Proofs 

A. 1 . Proof of TheoremUl 

We now show first part of the theorem. Let /o(w) = w T /x d /(w T Sw) 1 / 2 , /i(w) = w T /i d /(w T Sw) 1 / 2 , 
and /2(w) = w T /i c; /(w T Sw) 1 / 2 . Then, it follows easily that 

|/o(w c ) - / 2 (w c )| < Ai + A 2 , 

where Ai = |/o(w c ) — /i(we^)| and A 2 = |/i(wc — /2(w c )|. We now bound both terms separately in 
the following two steps. 

Step 1 (bound Ai): For any w, we have 

|/o(w) - /i(w)| < 



< 



(W T SW)V2 (w^Sw) 1 ^ 1 
l w IUIIAd — Mdlloo 



|w|| 2 A^ 2 n (E) 



< a/II w IIo 



CTO 



= v4RRO p (a„). (20) 

Since w^ 1 ' maximizes /i(-), it follows that 

/o(w c ) - /i(wW) = / (w c ) - A(w c ) + [A(w c ) - /i(w^)] 

</o(wc)-/i(w c ), (21) 

and similarly noticing w c maximizing fo(-), we have 

/i(w< 1} ) - /o(w c ) = /i(w«) - / (w«) + [/ (w«) - / (w c )] 

</i(w«)-/o(w( 1 )). (22) 

Combining the results of (|2"Tj) and (f2"2")l and using (j2"0")l . we conclude that 

Ai = |/ (w c ) - /i(w«)| = O p ((s c V 4 1 ))a„) . 
By the Lipschitz property of $, 

|<fr(/i(w«)) - $(/ (w c ))| = O p ((s c V s^)a n ) . 
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Step 2(bound A2): Note that wf and w c both are in the set {w : w T /x d = 1, ||w||i < 1}. Therefore, 
by definition of minimizers, we have 

wt 1)T SwW - WjSW c < 0, and SW C - w'^Sw' 1 ' < 0. 



Consequently, 



< w^(S - S)w c 

< HE-SIUIwcll? 

< c 2 ||S - 

= P (a„c 2 ). (23) 

By the same argument, wc also have 

^±w c - w« T £w« = [wjtw c - w^SwW] + w^SwW - wW'SwW 

< w« T (£ - S)wi 1 ' 

< c 2 ||S - SHoo 

= O p (a n c 2 ). (24) 
Combination of ((23|) and (f24|) leads to 

|wJSW c - w'^Sw^l = O p (a n c 2 " 



Let <?(x) = $(x x / 2 ). The function g is Lipschitz on (0, 00), as g'(x) is bounded on (0,oo). Hence, 
|*(/a(w c )) - ^(M^ ( c ] ))\ = O p (a n c 2 ). Thus, 

|W n (5w c ,0) - W(5 Wc ,0)\ < |#(/ 2 (w c )) - $(/ (wW))| + |$(/i(wf))) - $(/ (w c ))| 

= O p ((s c V s^H) + O p (a n c 2 ) 
= O p (b n ). 

We now prove the second result of the Theorem. Since |w^Sw c — w^Sw c | = O p (a n c 2 ), we have 

|*(/i(w c )) - *(/ 3 (w c ))| = O p (a n c 2 ). (25) 

By (|2T))) . (|25D . and the first part of the Theorem, we have 

\W(S Wc ,6)-W{S Wc ,0)\ 
= l<f(/o(w c )) - $(/o(w c ))| 

<|*(/o(w c )) - $(/i(w c ))| + |$(/i(w c )) - $(/ 2 (w c ))| + |*(/ 2 (w c )) - $(/o(w c ))| 

=O p (s c a n ) + O p (a n c 2 ) + O p (b n ) 

=O p (d n ). 



This completes the proof of Theorem. 
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A .2. Proof of Theorem H 

Let w A = Wqo + 7 A ■ Then, from the definition of w A , we have 

7 A = argmin R(woo + 7) + AHw^ + 7H1 

= argmin 7(7), (26) 

where f(j) = Rij) + XT,keK"\7k\ + ^keK (l w o© + 7fcl ~ l w ool)- I n the last statement, we used the fact 
that 

w£,£ 7 = ^/(/xjE-Vd) = 0. 

We write 7 for 7 A for short in what follows. 

By (gnj), we have f{j) < /(0) = 0. This implies that 

R(j) < AE fceX (|w^|-|w^+7 fc |) < \X keK \~f k \ <A^|| 7 ||2. 

On the other hand, R(j) > A m i n (S)||7|||. Bringing the upper and lower bound of R(*y) together, we 
conclude that 

Il7l| 2 < 



A m in(5j) 

The proof is now complete. 



A . 3. Proof of Theorem \B\ 

By the positive definiteness of S, X -1 and X" are also positive definite. Let then the 

transformation v 1— > w is linear. Define 

v c = argmin v T v, 

\\-£.- 1 ' 2 v\\ 1 <c.v T p, d = l 

1/9 

where n d = X ' fi d . It is enough to show that v c is piecewise linear in c. 

Let O c = {v : \\H~ 1 ' 2 v\\i < c} and S = {v : v T jl d = 1}. When c is small, the solution set is 0; when 
c is large, the constraint il c is inactive. Denote by "a" the smallest "c" such that il c f] S 7^ 0, and by "b" 
the smallest such that v c are the same for all c > b. Hence we are interested in c G [a, 6], when changes 
in c actually affects the solution. 

Let P be the projection of the origin O onto the hyperplane S in the p dimensional space. Let 

77 _ rcO cO . cl cl . qp-T- cP- 1 \ 

where 5* c denotes an i-dimensional face of r2 c , i.e., 5*°,, represents a vertex, iSj- c an edge, and S^ 1 a 
facet. It is clear that T c is a finite set. 

Define a mapping 93 : [a,d]->ZxZ, where <p(c) = such that i) v c £ S 1 * c and ii) i is minimal. 
By definition, this mapping is single valued. 

For any cq £ (a, &], denote D Co = {(i,j)\We > 0,3c £ [c — e, cq) s.t. 93(c) = (i,j)}. The set D Co is 
non-empty because the collection G Z x Z|5j c G J" c } is finite. Then the theorem follows from 

compactness of [a,b] and Lemma [21 Remark 0] and Lemma [3] 

Lemma 1. Vc £ (a,b], 3e > suc/i thatV(i,j) £ D Co andVcG (c -e,c ), G Sjj° c r\S, where Pj 
is the projection of P onto S(~) 5* c , and c denotes the i- dimensional affine space in which S 1 * c embeds, 
and Sj° c is the interior of c , where the topology is the natural subspace topology restricted to <Sj c . 
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Proof. Fix Co G (a, b]. For any G D Co and e > 0, by the definition of D Cfn there exists 

c' G [co — e, Co) such that ip(d) = The minimality of i in the definition for tp implies that v c i = 

Pj ;C , G Sj° c ,, which in the interior of S' l j c ,. Therefore, P] d G 5j° c , n 5. By arbitrariness of e, 3(c„) / c 
such that P\ „ G n S for all n. 

It can also be shown that {c|P? G S)° n 5} is connected: let P? . G 5f , n S, P) r , G n S, 
c[ < c f 2 . For any c f 3 G (c^c^), / is on the line segment with endpoints P % - , and P 2 , because S\ 

J j 3 J ! i J ) -"2 J' 

arc parallel affine subspace in R p . Let SJ )COne := Uc^o-Sj^, then it is a cone. Since Pj c , G S % ^ cone and 
^ e ^.cone, we have Pj c , G S* iCone . Then, P* c , G Sj, cone n Sn S) A = S*° , n 5. Hence, 3 e<J - > such 
that for all c G [co — ey, Co), Pj jC G Sj° c . Take e = min^ ,j)£D C(] e ij, the claim follows. 

Lemma 2. Vco G (a, b], D Co is a singleton, and Be' > s«c/i that v c is linear in c on (co — e', Co). 

Proof. Fix co G (a, 6] . We claim that for some (i, j) G D CQ , there exists positive e'(< e that validates Lemma 1) 
such that for any c G (co — e',co), u c = P/ c - Assume that the claim is not correct, then pick any 
G D CQ , there exists a sequence {ck} (c& ^ cy if fc ^ fc') converging to Co from the left s.t. v Ck ^ PJj Cfc . 
Without loss of generality, take {c^} C (c — e, Co). Lemma 1 implies that P? G <Sj° Cj , n 5. If v Cfc G , 
we would have v Ck = Pj Cfc - Hence v Ck £ 5j Cfc . By finiteness of the index pairs in ,F C , there exists 
7^ such that (p(c) = (i',f) for c G {c^,}, where {c^,} is some subsequence of {ck}- This 

implies G -D Co , which together with Lemma 1 implies v c = Pp c for c G {cj;,}. Therefore 

iip;', c -^i|2<!|p;, c -pii 2 

for c G {cfc,}. 

On the other hand, because (i,j) G D Co , there exist infinitely many c' G (co — e 7 co) such that 
\\P}', 1& -Ph> ||P;, C , - P|| 2 . Therefore, 

9(c) = ||P - PjJ\i - ||P - Pj'jli 

changes signs infinitely many times on (co — e, Co). This leads to a contradiction because P* c and Pp c 
are both linear functions of c. Hence, the conclusion holds. 

To show that D CQ is a singleton, suppose it has two distinct elements and (i',j'). We have shown 
that v c = Pj c and v c = Pp c for all c in a left neighborhood of Co (not including Co). Also we have 
Pp c G Sf c and P], c G S),° c by Lemma 1. This can be true only when Sf c C Sj',° c (or vice versa), but 
then i < i , contradicting with minimality in definition of D Co . 

Remark 4. Similarly, Vco G [a, &), 3e' > such that v c is linear in c on (co, Co + e'). 

Lemma 3. v c is a continuous junction of c on [a,b]. 

Proof. The continuity follows from two parts i) and ii). 

i) Vc G [a, 6), Be > such that v c is continuous on [c , c + e). Indeed, let 

/i(c) = min v T v. 

||S-it>||i<c,t>' r f r <i =i 

We know that the mapping c h-> v c (= P] c ) is linear and hence continuous on (co, Co + e) for some small 
e > 0. It only remains to show that the mapping is right continuous at Co- Notice here h(c) = \\Pj c \\2 f° r 
c G (c , c + e). Let L = lim c | Co P? . It is clear that L G Sp CQ . Because L G fl Co H S, h(c ) < \\L\\2- This 
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inequality has to take the equal sign because h(-) is monotone decreasing, and h(c) = HPJJH — > 1 1 i- 1 1 § as 
c approaches Co from the right. Because v Co is unique, v C(t = L = lim c ^ Cn P!j c = lim c ^ Co v c . 

ii) Vco G (a, b], 3e > such that v c is continuous on (co — e, Co]. Again, it remains to show that there 
is no jump at cq. Let (i Co , jc ) — l f( c o)- Clearly Pj c ° Cq g S"* 00 CQ . Introduce a notion of parallelism of 

afhne subspaces in W p . We denote 5^° jC \\ S, if only by translation, £^° jC becomes a subset of S (or vice 

versa in other situations) ; use the notation Sj C ° c l/f S otherwise. 

IiS) a ° r > 5, for c in some left neighborhood of c , P- c ° r exists and P- c ° r g sS° Note P* c ° r g fi c nS", 
and ||-P, Zc ° c ||2 - > || ^* c ° c ||2 as c approaches Co from the left. Since h(-) is monotone decreasing, obviously 

J CQ •■ J C 1 

/i(c) — > \\Pjl° co HI = M c o)- This shows the left continuity of at Co- Suppose D Co = then we 

know on a left neighborhood of cq (not including cq), f c = P/ iC - Let £7 = lim c -|- Co P* c , then E g f2 Co n S 1 . 
Note that ||P Jc ° || 2 > ||P,U| 2 for all c in c 's left neighborhood, so we have ||P* C ° c || 2 > ||P|| 2 . On the 
other hand, ||P* c ° Co j| 2 < ||P||2 by the definition of Pf° co - Also, consider the uniqueness of distance 
minimizing point in O co fl S to origin O, E = P* a ° Cq , and hence v c has left continuity at cq. 

If S]" || S, 3Q g ^cn-e/2 H 5 such that Q ^ P* c ° . When c goes from c — e/2 to Co, there exists 
a point Q c g £l c n S moving on the line segment from Q to P*°° Cq . Therefore, /i(-) is left continuous at 
Co . Replace Pjj° c by Q c in the previous paragraph, the left continuity of v c at cq follows from the same 
argument. 
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